Author: Goran S. Milovanovic, Data Science Serbia
Notebook: 01/30/2017, Belgrade, Serbia
This notebook accompanied the first Open Data R Meetup in Belgrade, organized by Data Science Serbia in Startit Center, Savska 5, Belgrade, 01/30/2017. The event took place to introduce the Data Science community in Serbia with the Open Data initative people, and celebrate the completion of the 2nd Introduction to R for Data Science course that Data Science Serbia has provided in Startit, Belgrade. As of now, more than 45 people have learned - or are currently learning - to code in R for free, thanks to our efforts to develop the Introduction to R for Data Science course.
The notebook focuses on an exploratory analysis of the test open data set of Traffic Accidents in Belgrade in 2015, provided at the Open Data Portal of the Republic of Serbia that is currently under development. The data set was kindly provided to the Open Data Portal by the Republic of Serbia Ministry of Interior. Many more open data sets will be indexed and uploaded in the forthcoming weeks and months.
Besides focusing on the exploration and visualization of this test data set, we demonstrate the basic usage of {weatherData} to fetch historical weather data to R, {wbstats} to access the rich World Data Bank time series, and {ISOcodes} packages in R.
Some exploratory modeling (Negative Binomial Regression with glm.nb() and Ordinal Logistic Regression with clm() from {ordinal}) is exercised merely to assess the prima facie effects of the most influential factors.
Disclaimer. The Open Data Portal of the Republic of Serbia is a young initiative that is currently under development. Neither the owner of this GitHub account as an individual, or Data Science Serbia as an organization, hold any responsibility for the changes in the URLs of the data sets, or the changes in the content of the data sets published on the Open Data Portal of the Republic of Serbia. The results of the exploratory analyses and statistical models that are presented on this GitHub account are developed for illustrative purposes only, having in mind the goal of popularization of Open Data exclusively. The owner of this GitHub account strongly advises to consult him (e-mail: goran.s.milovanovic@gmail.com and Data Science Serbia before using the results presented here in public debate or media, and/or for any purposes other than motivating the usage and development of Open Data.
Load libraries + raw data:
rm(list = ls())
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(ggplot2)
library(ggmap)
### --- Working Directory
wDir <- '../TrafficAccidentsBGD2015'
setwd(wDir)
### --- Load Raw Data Set
fileLoc <- 'OPENDATA_SNEZGODE exc.csv'
rawData <- read.csv(fileLoc,
header = T,
check.names = F,
stringsAsFactors = F)
### --- Inspect Data Set
dim(rawData)
## [1] 12873 11
Take a sneak peek at the data set:
glimpse(rawData)
## Observations: 12,873
## Variables: 11
## $ NEZG_ID <int> 1049390, 1049255, 1041851, 1041992, 1042207, 1042241...
## $ VREME_NEZ <chr> "30.01.2015,17:45", "29.01.2015,15:30", "13.01.2015,...
## $ GPS_X <int> 7460247, 7458383, 7461071, 7457669, 7454774, 7459982...
## $ GPS_Y <int> 4957951, 4961042, 4963801, 4963440, 4962700, 4959325...
## $ VRSTA_NEZ <chr> "Sa povredjenim", "Sa mat.stetom", "Sa povredjenim",...
## $ NAZIV_TIP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ NAZIV_DET <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ _COL8 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ _COL9 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WGS_X <dbl> 20.49236, 20.46856, 20.50232, 20.45934, 20.42281, 20...
## $ WGS_Y <dbl> 44.76500, 44.79271, 44.81769, 44.81425, 44.80741, 44...
What are we looking at:
bgdMap <- get_map(location = 'Belgrade',
maptype = "roadmap")
ggmap(bgdMap,
extent = "device") +
geom_point(data = rawData,
aes(x = WGS_X, y = WGS_Y),
size = .25,
color = "blue",
alpha = .35)
## Warning: Removed 220 rows containing missing values (geom_point).
With 220 points found outisde of the Belgrade Google Map area we’re already certain that some measurement error is present in the geolocalisation data. We’ll take care of it later.
Let’s take a quick look at the city core in a 2D density plot with geom_density2d from {ggplot2}:
bgdMap <- get_map(location = 'Kneza Milosa, Belgrade',
maptype = "roadmap",
zoom = 16)
ggmap(bgdMap,
extent = "device") +
geom_density2d(data = rawData,
aes(x = WGS_X, y = WGS_Y),
size = .1) +
stat_density2d(data = rawData,
aes(x = WGS_X, y = WGS_Y,
fill = ..level..,
alpha = ..level..),
size = 0.01,
bins = 16,
geom = "polygon") +
scale_fill_gradient(low = "white", high = "blue") +
scale_alpha(range = c(0, 0.3), guide = FALSE) +
theme(legend.position="none")
## Warning: Removed 12341 rows containing non-finite values (stat_density2d).
## Warning: Removed 12341 rows containing non-finite values (stat_density2d).
bgdMap <- get_map(location = 'Mostar, Belgrade',
maptype = "roadmap",
zoom = 16)
ggmap(bgdMap,
extent = "device") +
geom_density2d(data = rawData,
aes(x = WGS_X, y = WGS_Y),
size = .1) +
stat_density2d(data = rawData,
aes(x = WGS_X, y = WGS_Y,
fill = ..level..,
alpha = ..level..),
size = 0.01,
bins = 16,
geom = "polygon") +
scale_fill_gradient(low = "white", high = "blue") +
scale_alpha(range = c(0, 0.3), guide = FALSE) +
theme(legend.position="none")
## Warning: Removed 12563 rows containing non-finite values (stat_density2d).
## Warning: Removed 12563 rows containing non-finite values (stat_density2d).
Let’s start figuring out the variables:
# - VRSTA_NEZ
unique(rawData$VRSTA_NEZ)
## [1] "Sa povredjenim" "Sa mat.stetom" "Sa poginulim"
table(rawData$VRSTA_NEZ)
##
## Sa mat.stetom Sa poginulim Sa povredjenim
## 9484 80 3309
# - NAZIV_TIP
unique(rawData$NAZIV_TIP)
## [1] ""
## [2] "SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ"
## [3] "SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA"
## [4] "SN SA JEDNIM VOZILOM"
## [5] "SN SA PARKIRANIM VOZILIMA"
## [6] "SN SA PE??ACIMA"
table(rawData$NAZIV_TIP)
##
##
## 12634
## SN SA JEDNIM VOZILOM
## 38
## SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA
## 89
## SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ
## 62
## SN SA PARKIRANIM VOZILIMA
## 37
## SN SA PE??ACIMA
## 13
# - NAZIV_DET
head(unique(rawData$NAZIV_DET))
## [1] ""
## [2] "Najmanje dva vozila koja se kre??u istim putem u suprotnim smerovima uz skretanje ulevo ispred drugog vozila"
## [3] "Ostale nezgode sa u??e????em najmanje dva vozila bez skretanja (bez podataka o smeru)"
## [4] "Ostale nezgode sa najmanje dva vozila koja se kre??u istim putem u istom smeru uz skretanje"
## [5] "Ostale nezgode sa najmanje dva vozila ??? suprotni smerovi bez skretanja"
## [6] "Najmanje dva vozila koja se kre??u u istom smeru ??? sustizanje"
# - NAZIV_TIP vs. VRSTA_NEZ
kable(table(rawData$VRSTA_NEZ, rawData$NAZIV_TIP))
| SN SA JEDNIM VOZILOM | SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA | SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ | SN SA PARKIRANIM VOZILIMA | SN SA PE??ACIMA | ||
|---|---|---|---|---|---|---|
| Sa mat.stetom | 9307 | 27 | 62 | 52 | 36 | 0 |
| Sa poginulim | 76 | 1 | 2 | 0 | 0 | 1 |
| Sa povredjenim | 3251 | 10 | 25 | 10 | 1 | 12 |
Separate Date from Time:
# - Separate Date and Time
rawData <- separate(data = rawData,
col = VREME_NEZ,
into = c('Date', 'Time'),
sep = ',',
remove = F)
Separate Date to Day, Month, and Year:
# - Date --> Day, Month, Year
rawData <- separate(data = rawData,
col = Date,
into = c('Day', 'Month', 'Year'),
sep = '\\.',
remove = F)
Separate Time to Hour, Minute:
# - Time --> Hour, Minute
rawData <- separate(data = rawData,
col = Time,
into = c('Hour', 'Minute'),
sep = ':',
remove = F)
Add a new Date-Time format:
# - New Date-Time format:
rawData$DateTime <- paste(
paste(rawData$Month,
rawData$Day,
rawData$Year,
sep = '/'),
paste(rawData$Hour,
rawData$Min,
sep = ":"),
sep = ", "
)
And check:
# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2014 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2015 | 1042 | 1063 | 1168 | 1100 | 1206 | 1158 | 1104 | 1119 | 1311 | 1417 | 1178 | 1 |
There are some data points labeled by ‘2013’ and ‘2014’; probably errors; correct:
# - Year 2013, 2014: probably errors; correct
rawData[rawData$Year == '2014', 'Year'] <- '2015'
rawData[rawData$Year == '2013', 'Year'] <- '2015'
# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015 | 1042 | 1064 | 1168 | 1101 | 1207 | 1158 | 1105 | 1120 | 1311 | 1418 | 1178 | 1 |
Very few data points for December; get rid of it:
# - December --> too few data points; drop:
rawData <- rawData[-which(rawData$Month == '12'), ]
And check:
# - Check Data Set:
kable(table(rawData$Year, rawData$Month))
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015 | 1042 | 1064 | 1168 | 1101 | 1207 | 1158 | 1105 | 1120 | 1311 | 1418 | 1178 |
What’s this:
# - Outcome: from VRSTA_NEZ
unique(rawData$VRSTA_NEZ)
## [1] "Sa povredjenim" "Sa mat.stetom" "Sa poginulim"
Recode:
rawData$Outcome <- factor(rawData$VRSTA_NEZ)
levels(rawData$Outcome) <- c('Damage', 'Death','Injury')
levels(rawData$Outcome)
## [1] "Damage" "Death" "Injury"
Turn into an ordered factor (possible use: ordinal logistic)
# - Outcome as ordered factor
rawData$Outcome <- factor(rawData$Outcome,
levels = c('Damage', 'Injury', 'Death'),
ordered = T)
What is this + recode:
# - Type: from NAZIV_TIP
unique(rawData$NAZIV_TIP)
## [1] ""
## [2] "SN SA NAJMANjE DVA VOZILA ??? SKRETANjE ILI PRELAZ"
## [3] "SN SA NAJMANjE DVA VOZILA ??? BEZ SKRETANjA"
## [4] "SN SA JEDNIM VOZILOM"
## [5] "SN SA PARKIRANIM VOZILIMA"
## [6] "SN SA PE??ACIMA"
rawData$Type <- factor(rawData$NAZIV_TIP)
levels(rawData$Type) <- c('None',
'2VehTurnOrCross',
'2VehNoTurn',
'1Veh',
'parkedVeh',
'pedestrian')
levels(rawData$Type)
## [1] "None" "2VehTurnOrCross" "2VehNoTurn" "1Veh"
## [5] "parkedVeh" "pedestrian"
How many out of 12K+ recorded accidents were categorized:
# - how many accidents were categorized?
length(rawData$Type) - sum(rawData$Type == 'None')
## [1] 239
Geography:
# - Lat, Lon
rawData$Lat <- rawData$WGS_Y
rawData$Lon <- rawData$WGS_X
# - Check Geographical Data
# - Belgrade is on: Coordinates: 44°49′N 20°28′E
# - source: https://en.wikipedia.org/wiki/Belgrade
range(rawData$Lon)
## [1] 19.23912 22.80758
range(rawData$Lat)
## [1] 42.02545 45.11274
Is this cool? Probably not. Keep only data points between \(-3SD\) and \(+3SD\). N.B. This is not the right procedure to clean-up the data set. The correct procedure would involve having a vector map of Belgrade and then keeping only data points strictly found in the administratively defined region of interest.
# - filter
rawData <- rawData %>%
filter (Lon < mean(Lon) + 3*sd(Lon),
Lon > mean(Lon) -3*sd(Lon),
Lat < mean(Lat) + 3*sd(Lat),
Lat > mean(Lat) - 3*sd(Lat))
Sort:
# - Sort Data Set
rawData <- rawData[order(rawData$Day,
rawData$Month,
rawData$Hour,
rawData$Minute), ]
dataSet will be the working data set:
### --- Working Data Set
dataSet <- rawData[, c('DateTime', 'Day', 'Month', 'Year',
'Time', 'Hour', 'Minute', 'Lat', 'Lon',
'Type', 'Outcome')]
rm(rawData)
Save:
### --- Save Working Data Set
### --- Working Directory
wDir <- '../TrafficAccidentsBGD2015'
setwd(wDir)
write.csv(dataSet, file = 'MUP2015_TrafficAccidentsBGD.csv')
Now, let’s visualize properly w. {leaflet}:
The density of traffic accidents is dynamically represents by markers in the Leaflet map. Click or zoom in too discover fine-grained information.
library(leaflet)
# - add an HTML descriptive column to dataSet
dataSet$Description <- paste(
'<b>Date: </b>', paste(dataSet$Month, dataSet$Day, dataSet$Year, sep = "/"), '<br>',
'<b>Time: </b>', paste(dataSet$Hour, dataSet$Minute, sep = ":"), '<br>',
'<b>Outcome: </b>', dataSet$Outcome,
sep = '')
accMap <- leaflet() %>%
addTiles() %>%
addMarkers(lng = dataSet$Lon,
lat= dataSet$Lat,
popup = dataSet$Description,
clusterOptions = markerClusterOptions())
accMap
The severity of the traffic accident outcomes is represents by marker size and colors: yellow stands for ‘Damage’, orange for ‘Injury’, red for ‘Death’. Click or zoom in too discover fine-grained information.
# - Add Outcome Color
dataSet$OutcomeColor <- dataSet$Outcome
dataSet$OutcomeColor <- recode(dataSet$OutcomeColor,
Damage = 'Yellow',
Injury = 'DarkOrange',
Death = 'Red')
# - Add Outcome Warning (Marker Size)
dataSet$OutcomeWarning <- dataSet$Outcome
dataSet$OutcomeWarning <- recode(dataSet$OutcomeWarning,
Damage = '10',
Injury = '15',
Death = '20')
dataSet$OutcomeWarning <- as.numeric(dataSet$OutcomeWarning)
accMap <- leaflet() %>%
addTiles(urlTemplate = "http://{s}.tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png",
attribution = '© <a href="http://www.openstreetmap.org/copyright">OpenStreetMap</a>') %>%
addCircleMarkers(lng = dataSet$Lon,
lat= dataSet$Lat,
radius = dataSet$OutcomeWarning*2,
popup = dataSet$Description,
fill = T,
fillColor = dataSet$OutcomeColor,
stroke = F,
fillOpacity = .5)
accMap
What was the probability of a traffic accident resulting in injury in Belgrade, 2015? N.B. That is the conditional probability, \(P(Injury|Road Accident)\):
outcomes <- table(dataSet$Outcome)
pInjury <- outcomes['Injury']/sum(outcomes)
pInjury
## Injury
## 0.25296
That is some 25%:
pInjury*100
## Injury
## 25.296
What was the probability of a traffic accident with fatal consequences in Belgrade, 2015? N.B. That is the conditional probability, \(P(Fatal|Road Accident)\):
outcomes <- table(dataSet$Outcome)
pDeath <- unname(outcomes['Death']/sum(outcomes))
pDeath
## [1] 0.00488
which is less than a half percent:
pDeath*100
## [1] 0.488
Reminder: We do not know what is the probability of being engaged in a traffic accident, so do not rush with associating emotional responses to these probabilities…
Accidents by months:
byMonth <- dataSet %>%
group_by(Month) %>%
summarise(Accidents = n())
byMonth$Month <- as.numeric(byMonth$Month)
# byMonth$Month <- factor(byMonth$Month, levels <- byMonth$Month)
ggplot(byMonth, aes(x = Month, y = Accidents)) +
geom_line(color = "blue", size = .5) +
geom_point(color = "blue", size = 1.5) +
geom_point(color = "white", size = 1) +
scale_x_continuous(breaks = byMonth$Month,
labels = month.abb[as.numeric(byMonth$Month)]) +
ggtitle('Belgrade, 2015: Traffic Accidents by Month') +
ylim(1000, max(byMonth$Accidents)+100) +
theme_bw() +
theme(plot.title = element_text(size = 11))
Accidents by hour:
byHour <- dataSet %>%
group_by(Hour) %>%
summarise(Accidents = n())
byHour$HourData <- as.numeric(byHour$Hour)
ggplot(byHour, aes(y = Accidents, x = HourData)) +
geom_line(color = "blue", size = .5) +
geom_point(color = "blue", size = 1.5) +
geom_point(color = "white", size = 1) +
scale_x_continuous(breaks = byHour$HourData-.5,
labels = byHour$Hour) +
ggtitle('Belgrade, 2015: Traffic Accidents by Hour') +
xlab('Hour') +
theme_bw() +
theme(plot.title = element_text(size = 11))
Accident outcome vs. Month:
dataSet$MonthLabel <- month.abb[as.numeric(dataSet$Month)]
plot(table(dataSet$MonthLabel, dataSet$Outcome),
main = 'Accident Outcomes per Month',
col = "gold")
No pattern seems to be present:
testAccMonth <- as.matrix(table(dataSet$MonthLabel, dataSet$Outcome))
kable(testAccMonth)
| Damage | Injury | Death | |
|---|---|---|---|
| Apr | 816 | 265 | 3 |
| Aug | 788 | 276 | 4 |
| Feb | 804 | 243 | 6 |
| Jan | 767 | 256 | 5 |
| Jul | 782 | 283 | 7 |
| Jun | 802 | 322 | 6 |
| Mar | 859 | 298 | 3 |
| May | 859 | 311 | 8 |
| Nov | 870 | 252 | 4 |
| Oct | 1017 | 333 | 4 |
| Sep | 913 | 323 | 11 |
To confirm, \({\chi}^2\)-test for contingency tables:
chisq.test(testAccMonth,
simulate.p.value = T)
Pearson's Chi-squared test with simulated p-value (based on 2000
replicates)
data: testAccMonth X-squared = 26.791, df = NA, p-value = 0.1354
When do the fatal accidents take place?
fatalSet <- dataSet %>%
filter(Outcome == 'Death') %>%
group_by(Hour) %>%
summarise(Accidents = n()) %>%
t()
colnames(fatalSet) <- fatalSet[1,]
fatalSet <- fatalSet[-1,]
kable(t(fatalSet), caption = 'Fatal Accidents by Hour:')
| 00 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 21 | 22 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 4 | 2 | 3 | 1 | 3 | 4 | 2 | 3 | 3 | 6 | 2 | 3 | 2 | 8 | 1 | 1 | 2 | 4 | 3 |
Is there a pattern in the hourly distribution of the accident outcome?
byHourSet <- dataSet %>%
group_by(Hour, Outcome) %>%
tally() %>%
mutate(Percent = round(n/sum(n)*100,2))
ggplot(data = byHourSet,
aes(x = Hour, y = Percent, color = Outcome, fill = Outcome)) +
geom_bar(stat = "identity", position = "stack", width = .5) +
scale_fill_brewer(palette = "Blues") +
scale_color_brewer(palette = "Blues") +
theme_bw()
Any indication of a relationship? We already know that there will missing data for fatal accidents:
byHourSet <- dataSet %>%
filter(!(Outcome == 'Death')) %>%
group_by(Hour, Outcome) %>%
tally() %>%
spread(key = Outcome, value = n) %>%
ungroup() %>% dplyr::select(-Hour)
chisq.test(byHourSet,
simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: byHourSet
## X-squared = 49.848, df = NA, p-value = 0.0004998
Hm.
byHourSet <- dataSet %>%
group_by(Hour, Outcome) %>%
tally() %>%
mutate(Percent = round(n/sum(n)*100,2)) %>%
filter(Outcome == 'Injury')
ggplot(data = byHourSet,
aes(x = as.numeric(Hour), y = Percent)) +
geom_path(color = 'blue') +
geom_point(size = 1.5, color = 'blue') +
geom_point(size = 1, color = 'white') +
xlab('Hour') + ylim(20,35) +
ggtitle('Percent of Accidents w. Injuries per Hour') +
scale_x_continuous(breaks = as.numeric(byHourSet$Hour),
labels = as.character(0:23)) +
theme_bw()
Let’s inspect the distribution of traffic accidents per day:
dataSet <- unite(data = dataSet,
col = Date,
Year, Month, Day,
sep = "-")
Check whether there are any days with no accidents that are not in December 2015. for which we have no data:
distData <- dataSet %>%
group_by(Date) %>%
summarise(Frequency = n())
distData$Date <- as.Date(distData$Date)
dates2015 <- seq(as.Date("2015/1/1"), as.Date("2015/12/31"), "days")
w <- which(dates2015 %in% distData$Date)
obs <- rep(0, length(dates2015))
obs[w] <- distData$Frequency
distData <- data.frame(
Date = dates2015,
Frequency = obs)
length(which(distData$Frequency == 0)) # December data, right
## [1] 31
which(distData$Frequency == 0) # December data, right
## [1] 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
## [18] 352 353 354 355 356 357 358 359 360 361 362 363 364 365
distData <- distData[distData$Frequency > 0, ]
ggplot(distData, aes(x = Frequency)) +
geom_bar(stat = 'count', fill = "blue", alpha = .35, width = .5) +
xlab("Number of Accidents Per Day") + ylab("Frequency") +
ggtitle('Belgrade, 2015: Traffic Accidents Daily Frequency') +
theme_bw()
Could this be a Poisson distribution?
sampleStats <- lapply(1:10000, function(x) {
l <- list()
s <- sample(distData$Frequency, 50, replace = T)
l$mean <- mean(s)
l$var <- var(s)
l
})
meanStats <- sapply(sampleStats, function(x) x$mean)
varStats <- sapply(sampleStats, function(x) x$var)
poissonFrame <- data.frame(mean = meanStats,
var = varStats)
ggplot(poissonFrame, aes(x = mean, y = var)) +
geom_point(size = .1, color = "black") +
geom_smooth(method = "lm", size = .25, se = T, color = "blue", alpha = .35) +
theme_bw()
I don’t think so.
Is there anything that could help us predict the number of accidents per day? Here’s a proposal: let’ grab the weather data for Belgrade, 2015/01/01 - 2015/11/01, and see if there’s anything potentially useful in the data set:
(NOTE: bgdWeather2015.csv was obtained from the following lines that do not evaluate in this version of the notebook; the data are already prepared and stored locally.)
library(weatherData)
checkBGD <- checkSummarizedDataAvailability(station_id = 'LYBE',
start_date = '2015-01-01',
end_date = '2015-11-30',
station_type = 'id')
## Retrieving from: http://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1
## URL does not seem to exist: http://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1
## The original error message:
## Error in readLines(u): cannot open the connection
After checking the reported URL in the browser, it turns out the data are there in fact:
bgd2015URL <- 'https://www.wunderground.com/history/airport/LYBE/2015/1/1/CustomHistory.html?dayend=30&monthend=11&yearend=2015&req_city=NA&req_state=NA&req_statename=NA&format=1'
bgdWeather2015 <- read.csv(bgd2015URL,
header = T,
stringsAsFactors = F,
check.names = F)
# clean-up a bit:
colnames(bgdWeather2015) <- gsub("<br />","", colnames(bgdWeather2015), fixed = T)
bgdWeather2015$WindDirDegrees <- gsub("<br />","", bgdWeather2015$WindDirDegrees, fixed = T)
kable(head(bgdWeather2015))
Let’s clean up this, CloudCover first:
table(bgdWeather2015$CloudCover)
bgdWeather2015$CloudCover[which(is.na(bgdWeather2015$CloudCover))] <- 0
Events:
table(bgdWeather2015$Events)
bgdWeather2015$Events[!grepl("^[[:alpha:]]", bgdWeather2015$Events)] <- 'None'
Max Gust SpeedMPH goes to Yes/No:
table(bgdWeather2015$`Max Gust SpeedMPH`)
bgdWeather2015$WindGust <- ifelse(is.na(bgdWeather2015$`Max Gust SpeedMPH`), 'No', 'Yes')
Fix the CET column to match our Date (YYYY-MM-DD format):
bgdWeather2015$CET <- sapply(bgdWeather2015$CET, function(x) {
x <- strsplit(x, split = '-', fixed = T)
if (nchar(x[[1]][2])<2) {
x[[1]][2] <- paste0('0', x[[1]][2])
}
if (nchar(x[[1]][3])<2) {
x[[1]][3] <- paste0('0', x[[1]][3])
}
return(paste(unlist(x), collapse = "-"))
})
And save:
write.csv(bgdWeather2015, 'bgdWeather2015.csv')
(NOTE: loading bgdWeather2015.csv here). Now left join bgdWeather2015 to distData and prepare for modeling:
bgdWeather2015 <- read.csv('bgdWeather2015.csv',
header = T,
stringsAsFactors = F,
check.names = F,
row.names = 1)
bgdWeather2015$CET <- as.Date(bgdWeather2015$CET)
modelSet <- left_join(distData, bgdWeather2015,
by = c("Date" = "CET"))
modelSet$`Max Gust SpeedMPH` <- NULL
modelSet$Events <- factor(modelSet$Events)
levels(modelSet$Events)
## [1] "Fog" "Fog-Rain"
## [3] "Fog-Rain-Thunderstorm" "None"
## [5] "Rain" "Rain-Hail-Thunderstorm"
## [7] "Rain-Snow" "Rain-Thunderstorm"
## [9] "Snow" "Thunderstorm"
levels(modelSet$Events) <- c('None', setdiff(levels(modelSet$Events), 'None'))
levels(modelSet$WindGust) <- c('No', 'Yes')
Remove spaces in column names:
colnames(modelSet) <- gsub("\\s+", "", colnames(modelSet))
Fix $WindDirDegrees which is a character:
modelSet$WindDirDegrees <- as.numeric(modelSet$WindDirDegrees)
and save:
write.csv(modelSet, 'modelDataSet.csv')
Add some new features:
modelSet$DaySeverity <- cut(modelSet$Frequency,
breaks = quantile(modelSet$Frequency),
right = T,
include.lowest = T,
labels = as.numeric(1:4))
table(modelSet$DaySeverity)
##
## 1 2 3 4
## 89 85 81 79
modelSet$DaySeverity becomes an ordered factor:
modelSet$DaySeverity <- factor(modelSet$DaySeverity, ordered = T)
tail(modelSet$DaySeverity)
## [1] 1 4 2 1 1 1
## Levels: 1 < 2 < 3 < 4
Introduce DayofWeek feature:
modelSet$DayofWeek <- factor(weekdays(modelSet$Date))
levels(modelSet$DayofWeek) <- c('Monday','Tuesday','Wednesday',
'Thursday', 'Friday', 'Saturday',
'Sunday')
Inspect the effect of DayofWeek:
dayFrame <- modelSet %>%
group_by(DayofWeek) %>%
summarise(Frequency = sum(Frequency)) %>%
arrange(desc(Frequency))
kable(dayFrame)
| DayofWeek | Frequency |
|---|---|
| Monday | 2055 |
| Friday | 1982 |
| Tuesday | 1909 |
| Saturday | 1882 |
| Sunday | 1827 |
| Wednesday | 1607 |
| Thursday | 1238 |
Relevel DayofWeek:
modelSet$DayofWeek <- factor(modelSet$DayofWeek,
levels = dayFrame$DayofWeek[order(dayFrame$Frequency)])
levels(modelSet$DayofWeek)
## [1] "Thursday" "Wednesday" "Sunday" "Saturday" "Tuesday" "Friday"
## [7] "Monday"
Inspect Events:
eventFrame <- modelSet %>%
group_by(Events) %>%
summarise(Frequency = sum(Frequency)) %>%
arrange(desc(Frequency))
kable(eventFrame)
| Events | Frequency |
|---|---|
| Fog-Rain-Thunderstorm | 6079 |
| Rain | 2850 |
| None | 1312 |
| Rain-Thunderstorm | 705 |
| Fog | 535 |
| Rain-Snow | 424 |
| Snow | 248 |
| Thunderstorm | 190 |
| Fog-Rain | 80 |
| Rain-Hail-Thunderstorm | 77 |
Relevel Events:
modelSet$Events <- factor(modelSet$Events,
levels = eventFrame$Events[order(eventFrame$Frequency)])
levels(modelSet$Events)
## [1] "Rain-Hail-Thunderstorm" "Fog-Rain"
## [3] "Thunderstorm" "Snow"
## [5] "Rain-Snow" "Fog"
## [7] "Rain-Thunderstorm" "None"
## [9] "Rain" "Fog-Rain-Thunderstorm"
# - model: Poisson GLM
mData <- modelSet %>%
dplyr::select(-Date, -starts_with("Max"), -starts_with("Min"), -DaySeverity)
mData %>%
group_by(DayofWeek) %>%
summarise(Mean = round(mean(Frequency),2),
Var = round(var(Frequency),2)) %>%
kable()
| DayofWeek | Mean | Var |
|---|---|---|
| Thursday | 25.79 | 48.17 |
| Wednesday | 33.48 | 65.49 |
| Sunday | 38.87 | 77.98 |
| Saturday | 40.04 | 58.00 |
| Tuesday | 39.77 | 102.78 |
| Friday | 41.29 | 113.53 |
| Monday | 42.81 | 86.67 |
Overdispersion present; use Negative Binomial Regression in place of Poisson for count data:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
mFit <- glm.nb(Frequency ~ .,
data = mData)
# - inspect model
summary(mFit)
##
## Call:
## glm.nb(formula = Frequency ~ ., data = mData, init.theta = 43.86777439,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4603 -0.6858 -0.0353 0.5739 4.0627
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.2056174 2.0956883 2.007 0.04477 *
## MeanTemperatureF 0.0008798 0.0083022 0.106 0.91560
## MeanDewPointF 0.0026903 0.0090130 0.298 0.76533
## MeanHumidity 0.0041437 0.0040686 1.018 0.30845
## MeanSeaLevelPressureIn -0.0481397 0.0673899 -0.714 0.47501
## MeanVisibilityMiles -0.0044484 0.0089429 -0.497 0.61889
## MeanWindSpeedMPH 0.0091200 0.0044751 2.038 0.04156 *
## PrecipitationIn -0.0236448 0.0084898 -2.785 0.00535 **
## CloudCover -0.0172911 0.0103561 -1.670 0.09499 .
## EventsFog-Rain 0.3019344 0.2255759 1.339 0.18073
## EventsThunderstorm 0.0888676 0.1906498 0.466 0.64112
## EventsSnow 0.2250424 0.1870251 1.203 0.22887
## EventsRain-Snow 0.1188256 0.1788000 0.665 0.50632
## EventsFog 0.0398455 0.1722310 0.231 0.81704
## EventsRain-Thunderstorm 0.0170324 0.1675080 0.102 0.91901
## EventsNone 0.1351256 0.1652182 0.818 0.41344
## EventsRain 0.0577351 0.1615937 0.357 0.72088
## EventsFog-Rain-Thunderstorm 0.0620189 0.1620782 0.383 0.70198
## WindDirDegrees -0.0001668 0.0001361 -1.225 0.22054
## WindGustYes 0.0527579 0.0517900 1.019 0.30835
## DayofWeekWednesday 0.2563737 0.0496790 5.161 2.46e-07 ***
## DayofWeekSunday 0.4164511 0.0493509 8.439 < 2e-16 ***
## DayofWeekSaturday 0.4532986 0.0488916 9.271 < 2e-16 ***
## DayofWeekTuesday 0.4433796 0.0494880 8.959 < 2e-16 ***
## DayofWeekFriday 0.4662585 0.0490481 9.506 < 2e-16 ***
## DayofWeekMonday 0.5058560 0.0487037 10.386 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(43.8678) family taken to be 1)
##
## Null deviance: 546.63 on 333 degrees of freedom
## Residual deviance: 344.91 on 308 degrees of freedom
## AIC: 2413.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 43.87
## Std. Err.: 7.56
##
## 2 x log-likelihood: -2359.659
Predict:
predictFrame <- data.frame(Observation = rep(1:length(mData$Frequency), 2),
Value = c(predict(mFit, type = "response"),
mData$Frequency),
Type = c(rep("Predicted", length(mData$Frequency)),
rep("Observed", length(mData$Frequency))),
stringsAsFactors = F)
ggplot(predictFrame, aes(x = Observation, y = Value, group = Type, color = Type)) +
geom_line(size = .15) +
scale_color_manual(values = c("lightblue", "black")) +
theme_bw() +
theme(legend.key = element_blank()) +
theme(legend.title = element_blank()) +
ggtitle("GLM Negative Binomial: Predicted vs. Observed")
predictFrame <- spread(predictFrame,
key = Type,
value = Value)
ggplot(predictFrame, aes(x = Predicted, y = Observed)) +
geom_point(size = .5, color = "blue") +
theme_bw() +
ggtitle("GLM Negative Binomial: Predicted vs. Observed")
# - coefficients
mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
rownames(mCoeff)[w]
## [1] "(Intercept)" "MeanWindSpeedMPH" "PrecipitationIn"
## [4] "DayofWeekWednesday" "DayofWeekSunday" "DayofWeekSaturday"
## [7] "DayofWeekTuesday" "DayofWeekFriday" "DayofWeekMonday"
w <- setdiff(w, 1) # - supress the intercept from output
important <- data.frame(predictor = rownames(mCoeff)[w],
coefficient = round(exp(mCoeff$Estimate)[w], 2),
stringsAsFactors = F)
kable(important[order(-important$coefficient), ])
| predictor | coefficient | |
|---|---|---|
| 8 | DayofWeekMonday | 1.66 |
| 7 | DayofWeekFriday | 1.59 |
| 5 | DayofWeekSaturday | 1.57 |
| 6 | DayofWeekTuesday | 1.56 |
| 4 | DayofWeekSunday | 1.52 |
| 3 | DayofWeekWednesday | 1.29 |
| 1 | MeanWindSpeedMPH | 1.01 |
| 2 | PrecipitationIn | 0.98 |
mData <- modelSet %>%
dplyr::select(-Date, -starts_with("Max"), -starts_with("Min"), -Frequency)
library(ordinal)
##
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
##
## slice
# - model
mFit <- clm(DaySeverity ~ .,
data = mData)
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
## In addition: Absolute and relative convergence criteria were met
# - inspect model
summary(mFit)
## formula:
## DaySeverity ~ MeanTemperatureF + MeanDewPointF + MeanHumidity + MeanSeaLevelPressureIn + MeanVisibilityMiles + MeanWindSpeedMPH + PrecipitationIn + CloudCover + Events + WindDirDegrees + WindGust + DayofWeek
## data: mData
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 334 -395.53 847.06 5(0) 3.64e-12 4.6e+09
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## MeanTemperatureF 0.0511424 0.0700945 0.730 0.465622
## MeanDewPointF -0.0311045 0.0760710 -0.409 0.682623
## MeanHumidity 0.0627742 0.0349019 1.799 0.072083 .
## MeanSeaLevelPressureIn 0.1399800 0.5724987 0.245 0.806838
## MeanVisibilityMiles 0.0192045 0.0747324 0.257 0.797197
## MeanWindSpeedMPH 0.0819241 0.0396896 2.064 0.039006 *
## PrecipitationIn -0.1654075 0.1238633 -1.335 0.181744
## CloudCover -0.1876122 0.0900082 -2.084 0.037125 *
## EventsFog-Rain 2.3429351 1.7404577 1.346 0.178251
## EventsThunderstorm 1.1529011 1.5487647 0.744 0.456634
## EventsSnow 0.8726799 1.5059328 0.579 0.562255
## EventsRain-Snow 0.7597486 1.4504593 0.524 0.600419
## EventsFog -0.0775333 1.3813141 -0.056 0.955238
## EventsRain-Thunderstorm 0.2210704 1.3414142 0.165 0.869098
## EventsNone 0.9661348 1.3125211 0.736 0.461675
## EventsRain 0.3228195 1.2845021 0.251 0.801568
## EventsFog-Rain-Thunderstorm 0.3013201 1.2898518 0.234 0.815289
## WindDirDegrees -0.0008615 0.0011769 -0.732 0.464173
## WindGustYes 0.5221284 0.4314099 1.210 0.226170
## DayofWeekWednesday 1.5261618 0.4421710 3.452 0.000557 ***
## DayofWeekSunday 2.9207413 0.4598398 6.352 2.13e-10 ***
## DayofWeekSaturday 3.2456627 0.4608758 7.042 1.89e-12 ***
## DayofWeekTuesday 2.9690228 0.4509963 6.583 4.60e-11 ***
## DayofWeekFriday 3.1912727 0.4556060 7.004 2.48e-12 ***
## DayofWeekMonday 3.6822758 0.4602831 8.000 1.24e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 11.65 17.78 0.655
## 2|3 13.18 17.78 0.742
## 3|4 14.52 17.78 0.817
Refine:
mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
w <- setdiff(w, 1:3) # supress the intercepts from the output
rownames(mCoeff)[w]
## [1] "MeanWindSpeedMPH" "CloudCover" "DayofWeekWednesday"
## [4] "DayofWeekSunday" "DayofWeekSaturday" "DayofWeekTuesday"
## [7] "DayofWeekFriday" "DayofWeekMonday"
mData <- modelSet %>%
dplyr::select(DaySeverity, MeanWindSpeedMPH, CloudCover, DayofWeek)
# - model
mFit <- clm(DaySeverity ~ .,
data = mData)
# - inspect model
summary(mFit)
## formula: DaySeverity ~ MeanWindSpeedMPH + CloudCover + DayofWeek
## data: mData
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 334 -411.57 845.13 4(0) 3.88e-07 7.5e+03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## MeanWindSpeedMPH 0.034220 0.031119 1.100 0.271493
## CloudCover 0.004032 0.044188 0.091 0.927291
## DayofWeekWednesday 1.558539 0.427319 3.647 0.000265 ***
## DayofWeekSunday 2.787103 0.439185 6.346 2.21e-10 ***
## DayofWeekSaturday 2.973880 0.439313 6.769 1.29e-11 ***
## DayofWeekTuesday 2.806974 0.430021 6.528 6.69e-11 ***
## DayofWeekFriday 3.077950 0.436535 7.051 1.78e-12 ***
## DayofWeekMonday 3.489330 0.442451 7.886 3.11e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 1|2 1.4023 0.4004 3.502
## 2|3 2.8277 0.4213 6.712
## 3|4 4.0895 0.4385 9.326
mCoeff <- as.data.frame(summary(mFit)$coefficients)
w <- which(mCoeff$`Pr(>|z|)` < .05)
w <- setdiff(w, 1:3) # - supress the intercepts from output
important <- data.frame(predictor = rownames(mCoeff)[w],
coefficient = round(exp(mCoeff$Estimate)[w], 2),
stringsAsFactors = F)
kable(important[order(-important$coefficient), ])
| predictor | coefficient | |
|---|---|---|
| 6 | DayofWeekMonday | 32.76 |
| 5 | DayofWeekFriday | 21.71 |
| 3 | DayofWeekSaturday | 19.57 |
| 4 | DayofWeekTuesday | 16.56 |
| 2 | DayofWeekSunday | 16.23 |
| 1 | DayofWeekWednesday | 4.75 |
# - e.g. Hit rate (Accuracy of Classification)
countHits <- sum(
as.numeric(predict(mFit, type = "class")$fit) == as.numeric(modelSet$DaySeverity))
dataPoints <- length(modelSet$DaySeverity)
hitRate <- countHits/dataPoints
paste0("Correct classification rate: ", round(hitRate*100, 2), "%")
## [1] "Correct classification rate: 40.42%"
According to the Aviation Safety Network’s data, there were 16 fatal airliner accidents, and 560 airliner accidents in total in 2015. Source: Aviation Safety Network.
How many air carrier departures took place in 2015? We will use the {wbstats} package to access the World Data Bank to find out. The primary data source for this indicator is the International Civil Aviation Organization; the indicator can be accessed from World Data Bank. The indicator code in World Data Bank is: IS.AIR.DPRT (Description: Registered carrier departures worldwide are domestic takeoffs and takeoffs abroad of air carriers registered in the country.).
# Indicator: IS.AIR.DPRT
# - access World Data Bank
library(wbstats)
airDeps <- data.frame(wb(
indicator = 'IS.AIR.DPRT'
))
kable(head(airDeps))
| value | date | indicatorID | indicator | iso2c | country | |
|---|---|---|---|---|---|---|
| 2 | 1342831 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
| 3 | 1292064 | 2014 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
| 4 | 1222184 | 2013 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
| 5 | 1159026 | 2012 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
| 6 | 1093319 | 2011 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
| 7 | 1065038 | 2010 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | 1A | Arab World |
Not all codes in the iso2c field represent world countries, meaning: the data must be overlapping:
unique(airDeps$iso2c)
## [1] "1A" "S3" "B8" "V2" "Z4" "4E" "T4" "XC" "Z7" "7E" "T7" "EU" "F1" "XE"
## [15] "XD" "XF" "ZT" "XH" "XI" "XG" "V3" "ZJ" "XJ" "T2" "XL" "XO" "XM" "XN"
## [29] "ZQ" "XQ" "T3" "XP" "XU" "OE" "S4" "S2" "V4" "V1" "S1" "8S" "T5" "ZG"
## [43] "ZF" "T6" "XT" "1W" "AF" "AL" "DZ" "AS" "AO" "AG" "AR" "AM" "AU" "AT"
## [57] "AZ" "BS" "BH" "BD" "BB" "BY" "BE" "BZ" "BJ" "BT" "BO" "BA" "BW" "BR"
## [71] "BN" "BG" "BF" "BI" "CV" "KH" "CM" "CA" "CF" "TD" "CL" "CN" "CO" "KM"
## [85] "CD" "CG" "CR" "CI" "HR" "CU" "CY" "CZ" "DK" "DJ" "DO" "EC" "EG" "SV"
## [99] "GQ" "ER" "EE" "ET" "FJ" "FI" "FR" "GA" "GM" "GE" "DE" "GH" "GR" "GU"
## [113] "GT" "GN" "GW" "GY" "HT" "HN" "HK" "HU" "IS" "IN" "ID" "IR" "IQ" "IE"
## [127] "IL" "IT" "JM" "JP" "JO" "KZ" "KE" "KI" "KP" "KR" "KW" "KG" "LA" "LV"
## [141] "LB" "LS" "LR" "LY" "LT" "LU" "MO" "MK" "MG" "MW" "MY" "MV" "ML" "MT"
## [155] "MH" "MR" "MU" "MX" "MD" "MC" "MN" "ME" "MA" "MZ" "MM" "NA" "NR" "NP"
## [169] "NL" "NZ" "NI" "NE" "NG" "NO" "OM" "PK" "PA" "PG" "PY" "PE" "PH" "PL"
## [183] "PT" "QA" "RO" "RU" "RW" "WS" "ST" "SA" "SN" "RS" "SC" "SL" "SG" "SK"
## [197] "SI" "SB" "SO" "ZA" "ES" "LK" "SD" "SR" "SZ" "SE" "CH" "SY" "TJ" "TZ"
## [211] "TH" "TG" "TO" "TT" "TN" "TR" "TM" "UG" "UA" "AE" "GB" "US" "UY" "UZ"
## [225] "VU" "VE" "VN" "YE" "ZM" "ZW"
We need to extract only country data; load the {ISOcodes} package to access ISO3166 country codes:
library(ISOcodes)
data("ISO_3166_1")
head(ISO_3166_1$Alpha_2, 30)
## [1] "AW" "AF" "AO" "AI" "AX" "AL" "AD" "AE" "AR" "AM" "AS" "AQ" "TF" "AG"
## [15] "AU" "AT" "AZ" "BI" "BE" "BJ" "BQ" "BF" "BD" "BG" "BH" "BS" "BA" "BL"
## [29] "BY" "BZ"
Extract only country data from airDeps:
totFlights <- airDeps %>%
filter(iso2c %in% ISO_3166_1$Alpha_2, date == '2015')
print(paste0('Countries: ', length(unique(totFlights$iso2c))))
[1] “Countries: 159”
Let’s have a look at the data set now:
kable(head(totFlights, 20))
| value | date | indicatorID | indicator | iso2c | country |
|---|---|---|---|---|---|
| 23532.528 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AF | Afghanistan |
| 0.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AL | Albania |
| 68096.268 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | DZ | Algeria |
| 11759.184 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AS | American Samoa |
| 13116.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AO | Angola |
| 40046.652 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AG | Antigua and Barbuda |
| 145583.336 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AR | Argentina |
| 0.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AM | Armenia |
| 658699.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AU | Australia |
| 152056.520 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AT | Austria |
| 18199.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | AZ | Azerbaijan |
| 31211.184 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BS | Bahamas, The |
| 57444.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BH | Bahrain |
| 37219.116 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BD | Bangladesh |
| 22939.000 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BY | Belarus |
| 138446.155 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BE | Belgium |
| 120072.361 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BZ | Belize |
| 1208.520 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BJ | Benin |
| 4640.388 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BT | Bhutan |
| 29717.430 | 2015 | IS.AIR.DPRT | Air transport, registered carrier departures worldwide | BO | Bolivia |
Good. How many flights there were in 2015?
sum(totFlights$value)
## [1] 32960402
This number corresponds with the World Data Bank report on the IS.AIR.DPRT indicator: check out.
What was the probability of boarding a flight that will end in a fatal accident in 2015? Let’s see: there were 16 flights (Aviation Safety Network data) that had a fatal outcome in 2015, and 32960402 flights in total (World Bank Data):
pAirFatal <- 16/32960402
pAirFatal
## [1] 4.85431e-07
print(paste0("Percent Fatal: ", pAirFatal*100))
## [1] "Percent Fatal: 4.85430972595541e-05"
Goran S. Milovanovic, Data Science Serbia, 01/24/2017, Belgrade, Serbia